1 Import and first explorations

plot.igraph(
  misgraph, layout = layout_with_fr(misgraph), vertex.color = "red", vertex.frame.color = "yellow", vertex.shape = "circle", vertex.size = 3, vertex.label.font = 1, vertex.label.cex = 0.5, vertex.label.dist = 0.5, vertex.label.degree = -pi/2, vertex.label.color = "black", edge.color = "black", edge.curved = TRUE,edge.width = 0.2
,main = "Network Visualization of the data")

(a)

There are different ways for us to layout the graph. Layout randomly places the vertices uniform randomly on a plane. Layout in circle places the vertices uniformly on a circle, in the order of vertex ids. What we used here is forced-directed layout. The default function is called the Frunchterman-Reingold algorithm. Typically, force-directed algorithms use a physical simulation where some kind of attractive force (imagine a spring) are used to attract nodes connected by edges together. So ‘tightly’ connected clusters of nodes will show up close to each other, and those that are ‘loosely’ connected will be repulsed towards the outside. However, the algorithm does not specify where any node has to be other than these constraints.

gsize(misgraph)
## [1] 254
gorder(misgraph)
## [1] 77
edge_density(misgraph)
## [1] 0.08680793
diameter(misgraph, weights = NA)
## [1] 5

This is undirected graph. The size is 254, order is 77. The density of the graph is 0.08680793 and the diameter of the graph is 5. This graph is not complete since the number of edges is not equal to that of vertex*(vertex-1)/2.

set.seed(3)
V( misgraph )$label.cex <- ( degree ( misgraph )+10) / max ( degree ( misgraph ))
l <- layout_with_fr( misgraph )
plot.igraph(
  misgraph ,layout = layout_with_fr(misgraph), vertex.color = "red", vertex.frame.color = "yellow", vertex.shape = "circle", vertex.size = 3, vertex.label.font = 1, vertex.label.dist = 0.5, vertex.label.degree = -pi/2, vertex.label.color = "black", edge.color = "black", edge.curved = TRUE,edge.width = 0.2,main = " scaled version of the network based on the degree of each node"
)

This code is to adjust the label size based on the degree of each vertex so that it is easy to see who are the leading characters in this book and Jean Valjean has the most connections. Apart from that, from the vertex with relatively larger character names and some cluster, we can detect some communities by eyes.

2 Community detection

(a)

  1. Preparing the data
  2. Computing (dis)similarity information between every pair of objects in the data set.
  3. Using linkage function to group objects into hierarchical cluster tree, based on the distance information generated at step 1. Objects/clusters that are in close proximity are linked together using the linkage function.
  4. Determining where to cut the hierarchical tree into clusters. This creates a partition of the data. ### (b)
plot(mishclust, hang = -1, cex = 0.6, main = " HAC Cluster dendrogram with complete method", ylab = NULL, sub = NULL)

(c)

mod = c()
for (i in 1:10)
{
  labels = cutree(mishclust, i)
  mod[i] = modularity (x= misgraph , membership = labels )
}
plot (mod, type ="l",main =" Modularity for HAC")

(d)

labels = cutree(mishclust, 9)
V(misgraph)$color = labels

plot.igraph(
  misgraph, layout = layout_with_fr(misgraph),  vertex.frame.color = "white", vertex.shape = "circle", vertex.size = 3, vertex.label.font = 1,  vertex.label.dist = 0.5, vertex.label.degree = -pi/2, vertex.label.color = "black", edge.color = "black", edge.curved = TRUE,edge.width = 0.2, main = " HAC detection network"
)

From the plot, we think there are some problems with the community detection. Apparently, this story unfolded from Jean Valjean. However, some characters around Jean Valjean are detected in one community. Also, some characters on the margins who don’t have direct connection are detected into a community.

(e)

plot(mishclust,label=V(misgraph)$name, hang = -1, cex = 0.6, main = " HAC Cluster dendrogram after the number of communities are detected", ylab = NULL, sub = NULL)

(f)

mishclust <- hclust(d, method = "average")
plot(mishclust,label=V(misgraph)$name, hang = -1, cex = 0.6, main = "HAC Cluster dendrogram with average method", ylab = NULL, sub = NULL)

mod = c()
for (i in 1:10)
{
  labels = cutree(mishclust, i)
  mod[i] = modularity (x= misgraph , membership = labels )
}
plot (mod, type ="l",main =" Modularity for HAC")

labels = cutree(mishclust, 7)
V(misgraph)$color = labels

plot.igraph(
  misgraph, layout = layout_with_fr(misgraph),  vertex.frame.color = "white", vertex.shape = "circle", vertex.size = 3, vertex.label.font = 1,  vertex.label.dist = 0.5, vertex.label.degree = -pi/2, vertex.label.color = "black", edge.color = "black", edge.curved = TRUE,edge.width = 0.2
)

mishclust <- hclust(d, method = "single")
plot(mishclust,label=V(misgraph)$name, hang = -1, cex = 0.6, main = "Cluster dendrogram", ylab = NULL, sub = NULL)

mod = c()
for (i in 1:10)
{
  labels = cutree(mishclust, i)
  mod[i] = modularity (x= misgraph , membership = labels )
}
plot (mod, type ="l",main =" Modularity for HAC")

labels = cutree(mishclust, 9)
V(misgraph)$color = labels

plot.igraph(
  misgraph, layout = layout_with_fr(misgraph),  vertex.frame.color = "white", vertex.shape = "circle", vertex.size = 3, vertex.label.font = 1,  vertex.label.dist = 0.5, vertex.label.degree = -pi/2, vertex.label.color = "black", edge.color = "black", edge.curved = TRUE,edge.width = 0.2
)

After comparison, average gives better result. We give different number of communities for different methods. 7 is for average and 9 single. When we use single method, multiple single vertex are detected as a community alone, especially BishopCharles and there are a lot of connections around him, which is absurd. While using average methods, the clusters make more sense. There are more connections within one community. ## 2.2 Edge betweenness ### (a) Edge betweenness is a centrality measure for edges in a network. It can be used to partition the nodes in a network into communities because edges with high betweenness are usually bridges between densely connected clusters of nodes. Hence, we iteratively find and remove the edge with largest betweenness to divide a network into a hierarchy of nested communities.
( Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.)

(b)

mis_edgeb = cluster_edge_betweenness(misgraph)
plot(as.dendrogram(mis_edgeb), main = "Dendrogram of the Edge Betweenness")

print(mis_edgeb)
## IGRAPH clustering edge betweenness, groups: 11, mod: 0.54
## + groups:
##   $`1`
##    [1] "Gillenormand"                   "Toussaint"                     
##    [3] "Cosette"                        "MademoiselleGillenormand"      
##    [5] "LieutenantTheoduleGillenormand" "BaronessDeThenard"             
##    [7] "Woman2"                         "MadamePontmercy"               
##    [9] "Magnon"                         "MademoiselleVaubois"           
##   
##   $`2`
##    [1] "Zephine"        "Dahlia"         "Fantine"        "Blacheville"   
##    [5] "Fameuil"        "Favourite"      "Listolier"      "SisterPerpetue"
##   + ... omitted several groups/vertices

(c)

We observed that the edges between communities are cut off. It is easier for us to see the reslut. The code is to delete the edge with high betweenness value and it stops when it reach the the number of communities with highest modularity value.

From the plot, we can see that this code cuts the edges connecting the communities determined by the edge betweenness. This makes the different communities disconnected in the graph.

2.3 Spectral clustering and the Louvain algorithm

The Louvain algorithm

## IGRAPH clustering multi level, groups: 6, mod: 0.56
## + groups:
##   $`1`
##    [1] "Gillenormand"                   "Marius"                        
##    [3] "Pontmercy"                      "Cosette"                       
##    [5] "MademoiselleGillenormand"       "LieutenantTheoduleGillenormand"
##    [7] "BaronessDeThenard"              "MadamePontmercy"               
##    [9] "Magnon"                         "MademoiselleVaubois"           
##   
##   $`2`
##    [1] "Zephine"        "Dahlia"         "Fantine"        "Blacheville"   
##    [5] "Fameuil"        "SisterSimplice" "Favourite"      "Listolier"     
##   + ... omitted several groups/vertices

Six communities are detected using Louvain algorithm.

Leading eigen algorithm

## IGRAPH clustering leading eigenvector, groups: 8, mod: 0.53
## + groups:
##   $`1`
##    [1] "Gillenormand"             "Scaufflaire"             
##    [3] "Pontmercy"                "Toussaint"               
##    [5] "Cosette"                  "MaubertIsabeau"          
##    [7] "Fauchelevent"             "Gervais"                 
##    [9] "MademoiselleGillenormand" "MotherInnocent"          
##   [11] "JeanValjean"              "Woman2"                  
##   [13] "MadamePontmercy"          "Woman1"                  
##   [15] "BaptistineMyrie"          "Magnon"                  
##   [17] "Gribier"                  "MadameMagloire"          
##   + ... omitted several groups/vertices

Eight communities are detected using eigen clustering.

2.4 Conclusion

Overall, we concluded that edge betweenness algorithm outperforms dramatically in the sense of detecting the number communities, because from the plots we can see that the overlap is minimized. However, we lose the connection among the communities. In the louvain algorithm, we can still detect how many communities there are easily ( but not as easily as in the edge betweenness) and see the relations between each community. Furthermore, in the leading Eigen clustering we have obvious overlapping between clusters.

As for the outliers, we can not see any outliers in Louvain algorithm, which is its drawback in case of anamoly detection. However, the outliers are visible in the edge betweenness. But in the Leading Eigen clustering, even though the outliers can be detected , the density of the overlapp is too much to be noticed. Moreover, HAC has the worst performance.